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Abstract 

The following outlier hypothesis testing problem is studied in a universal setting. Vector observations are collected 
each with M > 3 coordinates. When a given coordinate is the outlier, the observations in that coordinate are 
assumed to be distributed according to the "outlier" distribution, distinct from the common "typical" distribution 
governing the observations in all the other coordinates. Nothing is known about the outlier and the typical distributions 
except that they are distinct and have full supports. The goal is to design a universal test to best discern the outlier 
coordinate. Applications of outlier detection include event detection and environment monitoring in sensor networks, 
understanding of visual search in humans and animals, and fraud detection and anomaly detection in big data. A 
universal test based on the generalized likelihood principle is proposed and is shown to be universally exponentially 
consistent, and a single-letter characterization of the error exponent achievable by the test is derived. It is shown 
that as the number of coordinates approaches infinity, our universal test is asymptotically efficient. Specifically, it 
achieves a limiting error exponent that is equal to the largest achievable error exponent when the outlier and typical 
distributions are both known. The results are also generalized to the case with multiple outliers wherein the number 
of outliers is fixed and known at the outset. 

Keywords: anomaly detection, big data, classification, detection and estimation, eiTor exponent, fraud detection, 
generalized likeliliood principle, outlier detection, outlier hypothesis testing, universal consistency, universally 
exponential consistency 

I. Introduction 

We consider the following inference problem, which we term outlier hypothesis testing. In vector observations each 
with M > 3 coordinates, it is assumed that there is one outlier coordinate. Specifically, when the i-th coordinate 
is the outlier, the distribution governing the observations in that coordinate is assumed to be distinct from that 
governing the observations in all the other coordinates, which all come from the same "typical" distribution. The 
goal is to design a test to decide which coordinate is the outlier We will be interested in the universal setting of 
this problem, where the test has to perform well without any prior knowledge of the outlier and typical distributions 
except that they must be different and have full supports. 

It is to be noted that our problem of outlier hypothesis testing is distinct from that of statistical outlier detection 
|[T], Q. In outlier detection, the goal is to efficiently winnow out a few outlier observations from a single sequence of 
observations. The outlier observations are assumed to follow a different generating mechanism from that governing 
the normal observations. The main differences between this outlier detection problem and our outlier hypothesis 
testing problem are: (i) in the former problem, the outlier observations constitute a much smaller fraction of the 
entire observations than in the latter problem (one-A/-th of all observations), and (ii) these outlier observations can 
be arbitrarily spread out among all observations in the outlier detection problem, whereas all the outlier observations 
are concentrated in one coordinated in the outlier hypothesis testing problem. 

This work was supported by the Air Force Office of Scientific Research (AFOSR) under the Grant FA9550-10-1-0458 through the University 
of Illinois at Urbana-Champaign, by the U.S. Defense Threat Reduction Agency through subcontract 147755 at the University of Illinois from 
prime award HDTRAl-10-1-0086, and by the National Science Foundation under Grant NSF CCF 11-11342. 
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Statistical outlier detection is typically used to preprocess large data sets, to obtain clean data that is used for 
some purpose such as inference and control. The outlier hypothesis testing problem that we study here arises in 
event detection and environment monitoring in sensor networks ||3), understanding of visual search in humans and 
animals |4|, fraud and anomaly detection |j5J, 1^6] in large data sets, and optimal search and target tracking |7|. 

Universal outlier hypothesis testing is part of a broader class of composite hypothesis testing problems in which 
there is uncertainty in the probabilistic laws associated with some or all of the hypotheses. To solve these problems, 
a popular philosophy of test design is the generalized likelihood principle ||8j, ||9). For example, in the simple- 
versus-composite case, the goal is to make a decision in favor of either the null distribution, which is known 
to the tester, or a family of alternative distributions. A fundamental result concerning the asymptotic optimality 
of the generalized likelihood ratio test in this case was shown in |10j . When some uncertainty is present in the 
null hypothesis as well, i.e., the composite-versus-composite setting, the optimality of the generalized hkelihood 
ratio test has been examined under various conditions |j9). Universal outlier hypothesis testing is closely related to 
homogeneity testing and classification 1 11 |-p5), both of which can be formulated as composite hypothesis testing 
problems. In homogeneity testing, one wishes to decide whether or not the two samples come from the same 
probabilistic law. In classification problems, a set of test data is classified based on another set of pre-acquired 
training data containing observations whose class membership is known. In 114], flSl, a classifier based on the 
generalized likelihood principle was shown to be optimal under the asymptotic Neyman-Pearson criterion. 

A metric that is commonly used to quantify the performance of a universal test is consistency. A universal test is 
consistent if the error probability approaches zero as the sample size goes to infinity, and is exponentially consistent 
if the decay is exponential with sample size. For binary hypothesis testing based on memoryless observations, one 
can obtain a universally consistent test based on the empirical distribution of the observations, when the distribution 
governing the observations under the null hypothesis is known [lOJ , | ,16J . However, it is impossible to achieve 
universally exponential consistency without any prior knowledge of the distribution governing the observations 
under the alternative hypothesis. Specifically, for any universal test, there exists a distribution for the alternative 
hypothesis that will render the exponent for the worst-case error probability to be zero. The same is true for 
homogeneity testing in that it is impossible to achieve exponential consistency for every distribution pair governing 
the two samples upon which homogeneity is tested p?) , p3) . 

In our outlier hypothesis testing problem, we have neither any information regarding the outlier and the typical 
distributions, nor any training data to leam these distributions before the detection is performed. The only information 
we are given is that there is exactly one outlier coordinate in the observation vector, while the rest of the coordinates 
come from the same typical distribution. In other words, the only prior knowledge we have about the hypotheses is 
in terms of the structure of the joint distribution of the observation vector under each hypothesis. As a consequence 
of the aforementioned negative results in binary hypothesis testing and homogeneity testing, it is not clear at the 
outset that a universally exponentially consistent test should exist, and even if it does, it is not clear what its structure 
and performance should be. 

Our main finding in this paper is that for our outlier hypothesis testing, one can construct universal tests that 
are far more efficient than for the other inference problems mentioned previously, such as homogeneity testing or 
classification. This seems quite surprising to us, as no training data is required in our test construction. Having 
said that, we note that we do assume that it is known at the outset that the outlier is indeed present, i.e., not all 
the coordinates are identically distributed. This critical assumption makes our universal outlier hypothesis testing 
unique from the other aforementioned inference problems considered in universal settings. This prior knowledge 
that the outlier is present is quite natural in several applications such as search problems and target tracking Q, 
(jT). Similarly for detecting credit card fraud, algorithms for outlier hypothesis testing can be executed on a set of 



3 



customers of a specific store that has detected a fraudulent transaction in its books. For other applications such as 
event detection, environment monitoring Q and anomaly detection it should be possible to quickly detect the 
presence of the outlier (event, anomaly) in the systems by looking at certain global attributes of the entire data. 
Then, our algorithms for outlier hypothesis testing can be used to efficiently pinpoint where the outlier is. Our 
findings advocate that such a prior knowledge of the presence of outlier, possibly incorporated, can be extremely 
useful for such applications. 

Our technical contributions are as follows. First, we propose a universal test that is based on empirical distributions 
of the coordinates of the vector observations. Our test follows the same principle as that underlying the generalized 
likelihood ratio test ||8|, ||9|. When only the typical distribution is known, we show that our test achieves the same 
optimal error exponent as in the case where both the typical and outlier distributions are also known. We then 
consider the completely universal setting where both the typical and outlier distributions are unknown, and prove 
that our test is universally exponentially consistent for all AI > 3. We also establish that as M goes to infinity, the 
error exponent achievable by our universal test converges to the optimal error exponent corresponding to the case 
where both the typical and outlier distributions are known. Thus our test is universally exponentially consistent 
and asymptotically optimal as M — >^ oo. Lastly, we also show that our results generalize to the case with multiple 
outliers wherein the number of outliers is fixed and known at the outset. 

II. Preliminaries 

Throughout the paper, random variables are denoted by capital letters, and their realizations are denoted by the 
corresponding lower-case letters. All random variables are assumed to take values in finite sets, and all logarithms 
are the natural ones. 

For a finite set 3^, let 3^™ denote the m Cartesian product of y, and 'P{y) denote the set of all probability 
mass functions (pmfs) on y. The empirical distribution of a sequence y — = (yi, . . . , j/m) e y"\ denoted by 
— -fy & Viy), is defined as 

liy) - —\{k^l,---,rn:yk=y}\, 

Consider n independent and identically distributed (i.i.d.) vector observations, each of which has M > 3 
independent coordinates. We denote the i-th coordinate of the fc-th observation by € 3^. It is assumed that only 
one coordinate is the "outlier," i.e., the observations in that coordinate are uniquely distributed (i.i.d.) according 
to the "outlier" distribution /i e Viy), while all the other coordinates are commonly distributed according to the 
"typical" distribution tt G 'P{y). Nothing is known about fj, and tt except that /i 7^ tt, and that each of them has 
full support. Clearly, if M = 2, either coordinate can be considered as an outlier; hence, it becomes degenerate to 
consider outlier detection in this case. 

When the i-th coordinate is the outlier, the joint distribution of all the observations is 

n 

fe=i 

where 
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The test for the outlier coordinate is done based on a universal rule S : y^^"^ — > {1, . . . , A/}. In particular, the test 
5 is not allowed to depend on (/i, tt). 

For a universal test, the maximal error probability, which will be a function of the test and (/i, tt), is 



e(<5,(/.,7r)) ^ . max ^ P. 

2 — 1 .... , ' ^ 

and the corresponding error exponent is defined as 

a{5,[^l,^T)) = lim --loge((5, (Ai,7r)). 

The following technical facts will be useful; their derivations can be found in |17|. Consider random variables 
F" which are i.i.d. according to p G 'P{y)- Let e 3^" be a sequence with an empirical distribution 7 e "/-"(J^). 
It follows that the probability of such sequence under p and under the i.i.d. assumption, is 

= exp{-n(i?(7||p) + i/(7))}, (1) 

where _D(7||p) and i/(7) are the relative entropy of 7 and p, and entropy of 7, defined as 

D{l\\p) ^ ^7(y)iog^^^^ 



and 



yey 

respectively. Consequently, it holds that for each y", the pmf p that maximizes is p = 7, and the associated 

maximal probability of y" is 

7(y") = exp{ -ni7(7)}. (2) 

III. Proposed Universal Test 

We now describe our universal test in two setups when only tt is known, and when neither /i nor tt is known, 
respectively. Our test follows the same principle as the generalized likelihood ratio test Q. 

For each z = 1, . . . , M, denote the empirical distributions of y^*^ by 7,. Note that the normalized log-likelihood 
of y^^" when the i-th coordinate is the outlier is 



(3) 



= -[H (7,,) + D (7Jm) ] - (M 1) [h (%^) + D II n) 

= -[H{j,) + D{j,\\fi)]-J2[H{j,) + D{j,\\n)], (4) 

for i^l,...,M. 

When TT is known and 11 is unknown, we compute the generalized log-likelihood of y*^" by replacing /i in (jij) 
with its maximum likelihood (ML) estimate 
- 7i, « = 1, • • • , A/, as 

= -^(7.) - E [^(^^■) + . (5) 
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Similarly, when neither ii nor tt is known, we compute the generalized log-likelihood of y^^" by replacing the 
/I and TT in (3l and (4i with their maximum likelihood (ML) estimates fli = 7^, and Tr^ = '^^-i' ' * ^ 1, . . . , Af 



(6) 



Finally, we decide upon the coordinate with the largest generalized log-likelihood to be the outlier. Using Q, 
(|6|, our universal tests in the two cases can be described respectively as 

6{y''-) = argmin C/r(/^"), (7) 

i=l,...,Af 

when only tt is known, where for each i ^ 1, . . . , M, 

UfV^yMn^ A Y^D{^,\\^), (8) 



and 



5{y'"'') = argmax C/r"(2/^^")' (9) 

i=l,...M 



when neither /i nor tt is known, where for each i — 1 , . . . , M, 



M-1 



(10) 



IV. Results 



Our first theorem in this section characterizes the optimal exponent for the maximal error probability when both 
/I and TT are known, and when only tt is known. 

Theorem 1. For every Af > 3, when /i and tt are both known, tlie optimal exponent for the maximal error 
probability is equal to 

2B(M,7r), (11) 
where B{fi,TT) is the Bhattacharyya distance between /i and tt, /i ^ vr, which is defined as 

yey 

Furthermore, the error exponent in is achievable by a test that uses only the knowledge of tt. In particular, 
such a test is our proposed test in Q, Q. 

Remark 1. It is interesting to note that when only /i is known, one can also achieve the optimal error exponent 
in However, we do not yet know if the corresponding version of our proposed test, wherein the tt in (|4]) is 
replaced with TTi = ^^j^ztj^i « = 1, • . • , M, is optimal. Nevertheless, a different test will be presented in Appendix 
[F] and will be shown to achieve the optimal error exponent in 

Consequently, in the completely universal setting, when nothing is known about /i and tt except that /i 7^ tt, and 
both /i and vr have full supports, it holds that for any universal test (5, 

< 2B{^l,^). (12) 

Notwithstanding the result in Theorem [T] without knowing either /i or tt, it is not clear at the outset that we can 
design a universal test 6 that yields a (5, tt)) > for every ^, tt, /x 7^ tt. One of our main contributions in this 
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paper is that our proposed universal test in Q and ( 10 i is indeed universally exponentially consistent. We also 
characterize the error exponent achievable by our proposed universal test. 



Theorem 2. Our proposed universal test S in Q and {10) is universally exponentially consistent. Furthermore, for 
every /i, tt G T-", /i 7^ tt, it holds that 



min D {qi\\ii) + D {q2\W) + . ..+D (gMlk) , 



where the minimum above is over the set of (qi, . . . , qu) such that 



M-1 



D 



1j 



^k^2 Ik 

M-1 



(13) 



(14) 



Note that for any fixed M > 3, e > 0, regardless of which coordinate is the outlier, it holds that the random 
empirical distributions (71, ... , 7m) satisfy 



(15) 



where || • ||i denotes the 1-norm of the argument distribution. Since jjfi + ^^^"'" Tr — > tt as M — > 00, heuristically 
speaking, a consistent estimate of the typical distribution can readily be obtained asymptotically in M at the outset 
from the entire observations before deciding upon which coordinate is the outlier This observation and the second 
assertion of Theorem [T] motivate our study of the asymptotic performance of our proposed universal test in (j9|l, 
( [Tol l when M -^00. 

Our last result in this section shows that in the completely universal setting, as M — > 00, our proposed universal 
test in (|9|, ([TOjl achieves the optimal error exponent in (111 corresponding to the case in which both p and n are 
known. 

Theorem 3. For each M > 3, the exponent for the maximal error probability achievable by our proposed universal 
test 5 in ([9]), \10\ is lower bounded by 



mm 



2Bip, g) 



(16) 



D{q\\yr)<T-^[2B{^L,7,)+C^) 

where C^r = — log ( min n{y) ]< 00 by the fact that n has a full support. 

The lower bound for the error exponent in \16\ is nondecreasing in M > 3. Furthermore, as M 



lower bound converges to the optimal error exponent 2i?(/i,7r); hence, it holds that 

lim q:((5, (/^, tt)) — 2_B(/i, tt). 



00, this 



(17) 



V. Numerical Results 

We now provide some numerical results for an example with = {0, 1}. Specifically, the three plots in the figure 
below are for three pairs outlier and typical distributions being fi — (p(0) — 0.3, p{l) = 0.7), tt = (0.7, 0.3); p = 
(0.35,0.65), TT = (0.65,0.35); and p, = (0.4,0.6), tt ~ (0.6,0.4), respectively. Each horizontal line corresponds 



to 2B{fj,, tt), and each curve line corresponds to the lower bound in ( 16 1 for the error exponent achievable by our 
proposed universal test. As shown in these plots, the lower bounds converge to 2i3(/i,7r) as M — >^ 00, i.e., our 
proposed universal test is asymptotically optimal for all three pairs p, tt. 
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VI. Extensions to the Case with Multiple Outliers 

In this section, we extend our results to the case with T outliers for any fixed T > 1. In particular, for large M, 
there will be (^) hypotheses instead of just M hypotheses as in the case with only one outlier Each hypothesis 
corresponds to the situation when all coordinates in a subset S C {1, . . . , Af} of size T are outlier coordinates. 
For the hypothesis corresponding to such a subset S*, the joint distribution of all the observations is 

k=i ies j^s 

The test 5 will now map every possible sequence of observations to a subset of {!,..., A/} of size T. The 
corresponding exponent for the maximal error probabiUty is now defined as 

a(S,(n,n)) = lim--log( max Ps {(5 7^ 5} | . 

n-i-oo n \ SC {1,...,J\/}, / 

\S\=T 

Motivated by the single outlier test in (j7]i, (|8]l, when only tt is known and n is unknown, we shall adopt the test 
that selects the hypothesis that yields the minimum 

f^.yp(^M„) A J2Dh,h) (18) 

among all possible outlier hypotheses each of which is indexed by a subset 5 C {1, . . . , M} with \S\ = T. 

Similarly, when neither /i nor tt is known, motivated by the single outlier universal test in (|9]), ( fTO] ), our test 
selects the hypothesis that yields the minimum 



IZkijsjf' \ (19) 



M-T 



among all the outlier hypotheses. 

When /i and tt are known and when only tt is known, the (fixed) number of outliers does not affect the optimal 
error exponent. 
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Theorem 4. For every fixed number T > \ of outliers, and every large M, when /i and tt are both known, the 
optimal exponent for the maximal error probability is equal to 

2B{p,7r). (20) 

This error exponent is achievable by a test that uses only the knowledge of tt. In particular, such a test is our 



proposed test in {18\. 



Our last result generalizes Theorem [3] to the case with multiple outliers via the use of our universal test in ( [T9| . 



Theorem 5. For every fixed number T > 1 of outliers, and every large M, our proposed universal test in ( |i9P 
is universally exponentially consistent. In particular, the exponent for the maximal error probability achievable by 
our universal test is lower bounded by 

^min^^ 2 BO.,,), (21) 



D 



('?lk)< lvT^(2B(Ai,T)+C,) 



where = — log ( min Tr{y)] < oo 



The lower bound for the error exponent in ( 27 I is nondecreasing in M, and converges to the optimal error 
exponent 2B{p, tt). Consequently, it holds that 

lim a(S,{n,Tr)) = 2B(^,7r). (22) 

M->-oo 

VII. Conclusion 

In this paper, we formulated and studied the problem of outlier hypothesis testing in a completely universal 
setting. Our main contribution is a universal test that yields exponentially decaying probability of error for every 
hypothesis regarding the position of the outlier. The main idea behind the test was to apply the generalized likelihood 
principle to a completely universal setting while taking advantage of the fact that there is exactly one outlier among 
the M coordinates. We also provided a characterization of the achievable error exponent of our test for every 
M > 3. A surprising fact was discovered that our test is not only universally exponentially consistent, but also 
asymptotically efficient as the number of coordinates goes to infinity. In particular, as M goes to infinity, the error 
exponent achieved by our universal test converges to the error exponent achieved by the optimal test when both 
the outlier and typical distributions are known. These appealing properties of our universal test, i.e., the universally 
exponential consistency and the asymptotic efficiency, were also shown to hold in the case with more than one 
outlier coordinates wherein the number of outliers is fixed and known. 

The results in this paper suggest a new approach to a number of applications including environment monitoring 
in sensor networks, fraud detection and anomaly detection. For example, in detecting credit card fraud, a common 
practice |j5J is to keep track of the transaction information of every individual customer over a period of time. A 
suspicion score is assigned to every customer by comparing the customer's current purchase with previous purchases 
and some standard expected usage patterns. An alarm is triggered if the suspicion score exceeds a certain threshold. 
This method may suffer from a high false alarm rate due to the fact that a decision is made without taking into 
account the changes in the customer's life, such as graduation, new job, marriage, etc., which may prompt changes 
in the customer's purchase pattern. The results in our paper suggest that it may be beneficial to approach credit 
card fraud detection from a different angle. Specifically, instead of comparing a customer's current transaction 
with previous purchases, we could compare that customer's transaction with a group of customers who share the 
same purchase pattern. An unusual but non-fraudulent transaction, which is possibly triggered by a special incident. 
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may not be classified as suspicious when examined against a group of customers. Our results also suggest that the 
accuracy of such a fraud detection mechanism can be quite close to ideal if a group with a sufficiently large number 
of customers is considered. 

It is interesting to note that our results rely critically on the assumption that the number of outlier coordinates 
is known exactly, an assumption that is well justified in some applications as explained in Section |l] For example, 
in the case with only one outlier, if only one additional hypothesis corresponding to the situation with no outlier 
is present, then the nature of the problem changes completely. In particular, for this new setup, it can be shown 
that there cannot exist any universally exponentially consistent test even when the typical distribution is known. 
The same pessimistic result holds when we look at the situation with any number (non-zero) of outliers up to T. 
Specifically, the feature that the supports of the outlier coordinates across various hypotheses do not subsulime one 
another makes it possible to construct an efficient universal test. As a consequence, in order to fully exploit the 
merits of our test in a particular application, it is essential that the exact number of outliers be given as a prior for 
hypothesis testing. 

We end with a discussion of possible extensions of our results. First, it is worth noting that although efficient in 
many cases, generalized likelihood tests fall short of optimality in some situations p8) , p9) . A different approach, 
namely, the "competitive minimax" approach, proposed by Feder and Merhav, is aimed at minimizing the worst-case 
ratio between the probability of error of a universal test and the minimum probability of error when the underlying 
distributions are fully known |18|. A similar "competitive optimality" framework was adopted in pO| to study the 
sample complexity of classifiers, wherein to achieve a fixed error probability, the goal was to minimize the ratio 
between the number of samples needed by a universal classifier, not knowing the underlying distributions, and 
that needed by the optimal classifier with the knowledge of those distributions. Under such competitive minimax 
performance criteria, it is interesting to see what the structure and performance of an optimal test are in the universal 
outlier hypothesis testing problem. Another interesting way to extend the results of this paper would be to consider 
models with the size of the alphabet being large compared to the number of samples from each coordinate |T9) , 
pT| . Such a situation is usually formulated as one in which the alphabet is allowed to grow with the number of 
samples. For our universal outlier hypothesis testing problem, a natural question arises as to how fast the alphabet 
can grow so that there still exists a test that is universally exponentially consistent. Another possible extension 
would be to generalize the results in this paper to the case with abstract observation alphabets. 
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Appendix 

Our proofs rely on the following two lemmas. The first of which is an extension of Sanov's theorem. 

Lemma 1. Let Y''^\ . . . , y'^'^^ be mutually independent random vectors with each Y''^\ j = 1, . . . , J, being n 
i.i.d. repetitions of a random variable distributed according to pj G Viy) with a full support. Let An be the set of 
all J tuples . . . , y^'^^) G J^''" whose empirical distributions (71, . . . , 7j) = (7^(1) , ■ • ■ , 7y(J)) in a closed 

set E e Vlyy . Then, it holds that 

1 ^ 
lim --logP| fy(l\...,y(^)) eA„| = mill V D{q,\\p,). 

n->oo n L V /J (qi,...,qj)eE 

Proof. We start with some well-known identities that will be useful in the proof of Lemma [T] 
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For any finite set X, let T™(A') be the set of all possible empirical distributions of all sequences in X™. For 
any 7* e V"{X), let T^* = {x e X"' : 7^ = 7* e T"'{X)}. Then, it can be shown (see, e.g., |jl^) that 

exp {miJ(7*)} 



(to + 1)1-^1 



< |r^*| < exp{TOi7(7*)}, 



and 



\T"'{X)\ < (m + l)l'*l. 



(23) 



(24) 



For random variables X™ which are i.i.d. according to p e V{X), it now follows from ^ and (23 1 that for every 

7* e r'{x), 



(to + 1)1' 
Now consider the following set 

^ i?n{(7^(i),...,7j;U)) I {y^'\-.-,y^'^)&y'''}- 



(25) 



Using (24 1, we get from the definition of £"„ that 

\En\ < (n + 1)^1^1. (26) 
For any {qi, . . . , qj) E En, the probability under {pi, . . . ,pj) of the set of tuples of independent sequences in 



Tg-^ X ... X Tqj can be bounded using ( 25 1 as 



exp(^~n^ D{qj\\pj)^ J J 

- < n( H ^j(^)) - exp(-n^L'(gj||pj; 



{n + iy\y\ 



(27) 



3 = 1 



Note that, for each n, 



P{(y(l 

Using (|26]) and (|27]), we get that 



(9i,...,gj)e£;„ J = l xST,^ 



< (n+l)^l^l max Hf E ^^.(^))' 



and 



J 



^{{y^'\...,y^'^)eAA > max ( E ^'^■(^ 



> max 

(gi,...,o/) G £„ 

From ( |28] l, (|29| and the fact that UnEn is dense in as n — >^ 00, it follows that 

lim --logP|(y(i),...,y(^)) e A„| = min Vi?(g,|lp,) 



(28) 



(29) 



□ 
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Lemma 2. For any two pmfs pi, p2 G 'P(3^) with full supports, it holds that 



2B{pi,p2) = min D {q\\p^) + D {q\\p2) . (30) 
qev(y) 



In particular, the minimum on the right side of pOl) is achieved by 



E Pi i.y)P2 iy) 

Proof. It follows from the concavity of the logarithm function that 

D{q\\p^)+Diq\\p2) = ^g(z7)log f^y] 

^ pi(y)P2[y) 

1 1 



ri(l/)p|(!?L,„,j,^ (31) 



> -2iog('^pf(y)4(y)) (32) 

^ yey ^ 
= 2B{p,,p2). 



In particular, equality is achieved in (32 1 by q{y) ~ q*{y) in (Bib. □ 



,Mti 



A. Proof of Theorem [7] 

When /i and tt are known, it is clear that the optimum test is the ML one. In particular, for any y^' 

. . . , y*-^^') e 3^^^", with ^y(i) = ji, i — 1, . . . , M, conditioned on the i-th coordinate being the outlier, it 
follows from (|4| that 

^logK(/^") = -[i/(7.)-i?(7dlA^)] -E[^(7,) + i?(7,lk)' 



Consequently, the ML test is 



5(y''^) = argmin U,{y''^) 

i=l,...,M 



where for each i = 1, . . . , M, 

U^{y''") ^ D{j,Ui)+J2Dh,h). (33) 

By the symmetry of the problem, it is clear that {6 ^ i] is the same for every i = 1, . . . , M\ hence, 

max^^P, {(5 7^ = Pi {(5 7^ 1} . 

By the fact that 

Pi {,5^1} = Pi( U,^, {C/i > [/, }), (34) 

it holds that 

M 

Pi{C/i>C/2} < Pi {5 7^1} < ^Pi{C/i > C/,}. (35) 

J =2 
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Next, we get from ( 33 i that 

Pi{Ui>U2} = Pi{i?(7i||A*)+^(72|k) >i^(7i|k) + i?(72||A*)}. 
Applying Lemma [T] with J = 2, = /i, p2 = tt, 

E = {gi,(72 : D {q,\\fi) + D {q2\\7r) > D {q,\\7r) + D {q2\\fi) } , 
we get that the exponent for Pi {Ui > U2} is given by the value of the following optimization problem 

min D{qi\\fi) + D{q2\\n). (36) 

91.92 e 'Piy) 

D{qi\\fi)+D{q2\\lT)>D{qi\\Tv)+D{q2\\fi) 



The optimization problem (36i is convex and its solution can be easily computed to be 2i3(/i, tt). 

By tl 
we get 



By the symmetry of the problem, the exponents of Pi {Ui > Ui}, i I, are the same, i.e., for every i = 2, . . . , M, 



lim --logPi {L/i > = 2B{fi,n) 



It now follows from ( (35] l and ( (37] i that 



lim log Pi {(5 7^1} = 2B(^,7r) 



(37) 



(38) 



It is now left to prove that when only tt is known, our proposed test S' in ^ also achieves the error exponent 
2i3(/i,7r). In particular, it follows from the same argument leading to ( [38| l that 

lim --log Pi {(5 V 1} = lim --logPilL'/P < (39) 

The exponent on the right-side of (39i can be computed by applying Lemma [T] with J = 2,pi = pi,p2 = tt, and 
(cf.©) 



to be 



E' = {qi,q2\D{q2\\TT)>D{qi\\TT) } 



min D(gill^) +D(g2||7r). 

quq2ev(y) 

■D(92|k) > D{qi\\TT) 



The optimal value of (41 1 can be computed as follows 



min D (qiWfi) + D {q2\\n) 

qi,q2<^V(y) 
D{q2\H>D{qi\\7T) 



(40) 



(41) 



(42) 



> min D (qiWfi) + D (qiWir) 

91 



(43) 

= 2B{^^,7^), (44) 

where the equality in (44i follows from Lemma |2] Since the minimum in (43 1 is achieved by gi = (7* in (31 1 with 
Pi — jJL, p2 — TT, and (?i = 92 = 9* satisfy the constraint in (|42]l, the inequality in ( 43 1 is in fact an equality. 
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B. Proof of Theorem |2] 

When /i and tt are unknown, we adopt our universal test (5 in (|9]) and (lOi. In particular, the same argument 



leading to (38i yields that 



1 



lim - - logPi{(5 7^ 1} = lim --logPilL^' 



univ ^ ^univ 



}■ 



(45) 



The exponent on the right-side of (45 i can be computed by applying Lemma [T] with J = AI, pi = /i, pj = vr, j = 
2, . . . , M, and (cf.^) 



E = Uq 



.,qM) ■■ E ^\13 



'>2k^2 9k 

M-1 



to be 



min D (qiWn) + D{q2\\7r) + ... + D (gA/|k) , 

((Ji,...,i2m) e E 



(46) 



Unlike the convex optimization problems (36i and (41 1, the optimization problem in (46i for the completely 
universal setting is much more complicated, and a closed-form solution is not available. However, we show that the 



value of (46 1 is strictly positive for every /i, tt, /i 7^ tt. In particular, it is not hard to see that the objective function 
is continuous in qi , . . . , qM and the constraint set E is compact. The claim then follows by virtue of fact that the 
value of the objective function in (46 1 is strictly positive at every feasible qi, . . . , qM- Thus, our proposed test is 



indeed universally exponentially consistent. 



C. Proof of Theorem |5] 

By the continuity of the objective function on the right-side of ( [T3] l and the compactness of the constraint set ( [T4| , 



for each M > 3, the optimal value on the right-side of ( 13 1, denoted by V*, is achieved by some {q*, . . . , qli). 
It follows from ([T3| and ([14]) that 



V 



* > Diql\\^,) + J2D{q*\\n)-J2D{1^\\^f#)+T.D{1^ 



= D{q*\\f,) + J2D[Q*, 
> D{qU\^^) + D(ql\m0f 



> 2B\ u ^'■^^ 

- ^^\t^i M-l 



'^k^2 Ik 
M-l 

^k^2 Ik 
M-l 



i5^2 



II^fc^2 Ik 

M-l 



E E'?l(2')i°g 

j^i yey 

(M-l)i?(%^ 



jr^Ek^i<iUy) 



2B 



M-2(T.kUll 



M- 



M-2 



where the last inequality follows Lemma [2] 



(47) 
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On the other hand, it follows from ( 12 1 that the value on the right-side of ( 13 1, V*, satisfies 



= Diql\\^,) + J2D{q*\\n) 



M 



where the last inequaUty follows from the convexity of relative entropy. 

Combining (47 1 and (48 i, we get that the value V* on the right-side of ( 13 1 is lower bounded by 



mm 

(Af-2)_D(g||7r) < 2B{^L.,-K) 



Note that the constraint in ( 49 1 can be equally written as 

i^(gi||7r) + (M-2)i^(g||7r) < 2B{^x,^) + D {qi\\n 
By the convexity of relative entropy, it follows that 

D{q,\\n) + {M-2)D{q\U) > {M - 1)d(si±^^^ 



As a result, the optimal value of ( 49 1 is lower bounded by the optimal value of 



2bU, j^gi 



mm 

< 2B(fj.,7v)+D(qi\\Tv) 

By the fact that tt has full support, it holds that 

D{qi\\n) < -log ( min7r(?;) ) = < oo. 
\ yey / 

Proceeding from ( [50| , by using ( [5T] i, we get that the optimal value of ( [T3] l is lower bounded by 

min 2BU,q'). 
q'ev{y) 



(48) 



(49) 



(50) 



(51) 



(52) 



The assertion in (17i follows by virtue of fact that for any /i,7r e 'P{y) with fuU supports, it holds that 

1 



0. 



This establishes that, as M oo, our proposed universal test is indeed asymptotically optimal. 

Furthermore, for any /i,7r S 'P(3^), M 7^ ti", the value of ■jj—^{2B{^,Tr) + C{tt)) is strictly decreasing with M. 
Consequently, the feasible set in ([16]) is nonincreasing with Af , and, hence, the optimal value of ( 16 1 is nondecreasing 
with M. 
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D. Proof of Theorem |4] 

Similar to ( |33] l, when fi and tt are both known, the ML test is 



argmin {/s(2/*'"), 

SC{1,...,M} 
|S| = T 



where for each S* C {1, . . . , M} , jS*] = T, 



^ 5]z?(7.||/i) + E^(7.lk)- 

ies j^s 



(53) 



As in the argument leading to (38i and the symmetry of the problem, for each fixed 5, to calculate the error 
exponent for the conditional probability of error conditioned on the hypothesis corresponding to S, it suffices to 
look at the 5" that will yield the smallest exponent for the probability P5 {Us > Us'}- Now, we get from (53 1 that 

Fs{Us>Us'} - Ps{ E ^(7.11^^)+ E ^(7.lk) > E ^(7.lk)+ E ^(7jIIm)|- 

i£S\S' icQXQ' 



]es'\s 



i<£S\S' 



jeS'\s 



Applying Lemma[T]with J = | {S\S') U {S'\S) \, p^ ^ fi, ie S\S' , Pj = tt, j E S'\S, 



E 



zeS\S', 7,, J(^S'\S : E ^(7.lk)+ E ^(7,lk) ^ E ^(7»lk)+ E ^(7.IIm)|, 



ies\S' jes'\s ies\S' jes'\s 

we get that the exponent for ¥3 {Us > Us'} is given by the value of the following optimization problem 



mm 

{'}i}i€S\s'' ilj}jes'\s '■ 
ies\s' jes'\s 
ies\s' 3es'\s 



E D{q^y)+ E ^feik)- 

ies\s' jeS'\s 



(54) 



The optimization problem (54 1 is convex and its solution can be computed to be 2| {S\S') \B{^,tt) which will be 
smallest for 5" such that | {S\S') \ = 1 yielding the error exponent 2_B(/i,7r). 

It is now left to prove that when only tt is known, our proposed test S' in ( I81 also achieves the error exponent 



2B (/i, tt). In particular, similar to the argument leading to (38 1, we get that for any S* C {1, . . . , M} , \S\ — T, 



lim log 

71— foo n 



mm 

S'C{1,...,A/} 
S' ^S,\S'\=T 



lim --logPs(f/S'P > U'If\. 



(55) 



For each such S' 7^ 5, the exponent for 

Ps > U'^f^ can be computed by applying Lemma 1 with J = | {S'\S) U {S\S') |, = /i, i G 5\S", Pj 

TT, j e S'\S, and (cf. ^) 

^'={{7.W-{7,},,5'\5 ^ E ^(7.lk) > E ^(7.lk)} 



jes'\s 



ies\S' 



to be 



mm 

{li}ies\s' ' jes'\s '■ 
E D(q,\\7:) > E D(q4-^) 
jes'\s ies\s' 



E D{q,\\^,)+ E ^felk) 

ies\S' jes'\s 



(56) 



Using Lemma [2] we get that the optimal value of (56l is lower bounded by 2\S'\S\B (/x,7r). It now follows from 
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(55 I and the first assertion of this theorem proved earher that 



lini --\ogPs{S' ^ S} = 2B(^,7r). 



E. Proof of Theorem [5] 

The proof of universally exponential consistency of our test 5 in (19 1 is similar to the proof of Theorem |2] but 



with a different characterization for the achievable error exponent of our test in (19 1 now being 

T M 

a{5,{^i,T:)) = min D {q^\\^i) + V D((?j||7r), 

qi,...,qM — ' ^ — ' 
1=1 3=T+l 

where the minimum above is over the set of (gi, . . . , qm) such that 



(57) 



M 



2^k = T + l gfc 

M-T 



> min 

S'C{1,....A/} 
S'^{1,...T},\S'\=T 



M-T 



(58) 



For each S" 7^ {1, . . . T} , = T, let Vg, denote the optimal value of the optimization problem in (57i but 
with the constraint in (58 1 being for just that one 5" on the right-side (instead of the minimization over all possible 



such S' as in (58i), i.e.. 



M 

E ^ 

]=T+1 



Z^k = T + l 'i^ 
M-T 



^ E^ 



1] 



M-T 



(59) 



In order to prove the lower bound in (21 1, it now suffices to prove that for each such 5', V"^, is lower bounded 
by (121]). We shall establish this for just for one 5" = {1, . . . , T - 1, T + 1}. The proof for the other 5' follows in 
a very similar manner 

For S" = {1, . . . , T - 1, T + 1} , say Vg, is achieved by some {q\, ql.j). Then, we get from (BtI, E9\ that 



M 



M 



V^, > Y.D{q*\\t,)+ ^ D{q1;\\n)- ^ D(q* 



1=1 
T 



i=T+l 



J=T+1 



'l2k = T+l Ik 



M-T 



HkjS' Ik 

M-T 



Y.D{q*\\y^)+Y,D(q* 



12kiS' Ik 



M-T 



{m-t)d(^ 



^k = T+l ^k 
M-T 



> D{q*j.\\^,) + D{q*j,\mf^ 



' M-T 



( M-T-l \ 
\ M-T ) 



X]a; = t + 2 Ik 

M-T-1 



(60) 



On the other hand, it follows from the first assertion of Theorem [4] that V^; satisfies 

T M 
= Y.D{q:\\f,)+ D{q*\\n) 
i=l J=T+1 
M 

> E ^fe^ii-) 

Efc = T + 2 



> (M -T -1)d(^- 



M-T-l 



(61) 
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The proof of the lower bound in (21 1 now follows from (60i, (61 1 in a manner that is very close to the argument 



leading to ( [52| and the rest of the proof of Theorem |3] 

F. Optimal Test for One Outlier When Only fi Is Known 

Now we address the issue raised in Remark [T] In particular, when only ^ is known, instead of using the 



corresponding version of the test proposed in Section III we adopt the following "pairwise" test S. Recall that 



with y^^" — {y^^\ . . . , y*-*^-*), we denote the empirical distribution of y^^'> by 7^, i = 1, . . . , M. Given the entire 
observations, the new test declares the i-th coordinate to be the outlier if it holds that for every j ^ i, j = 1, . . . , M, 

Di^^Wfi) < Di-f,\\fi). (62) 



When there is no such i satisfying (62i, the detector outputs any fixed coordinate, say, coordinate 1. 



It now follows using ( 62 1, that 

Pi{^^l} < (M-l)Pi{D(7i||Ai) > D(72||/i)} 
Applying Lemma [T] with J = 2, pi = p, p2 = tt, 

E - {71,72 : DijiWii) > D{j2y)}, 
we get that the exponent for Pi {(5 7^ 1} is given by the value of the following optimization problem 

+ -D(92||vr) 

91,92 6 v(y) 
DiqiWt^) > -D(92||m) 

> mill D{q2\\^l) + D{q2\\^T) 
92 

- 2B(M,7r). 
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